The abundances reported by a purification affinity experiment can be used to infer if two proteins interact. This data takes the form of an abundance measured across a number of experiments for every protein involved in the experiment. If a protein is abundant it is more likely to interact with other proteins.
Missing values will be treated as they were for the affinity feature extraction. This means that we must pickle an average value for this feature to use over missing values on the full training set.
In [1]:
cd ../..
In [3]:
import csv
In [4]:
f = open("datasource.abundance.tab","w")
c = csv.writer(f,delimiter="\t")
# just the abundance feature
c.writerow(["forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv",
"forGAVIN/pulldown_data/dataset/abundance.Entrez.db","ignoreheader=1;zeromissinginternal=1"])
f.close()
In [6]:
import sys
In [7]:
sys.path.append("opencast-bio/")
In [8]:
import ocbio.extract
In [10]:
!git annex unlock forGAVIN/pulldown_data/dataset/abundance.Entrez.db
In [11]:
assembler = ocbio.extract.FeatureVectorAssembler("datasource.abundance.tab",verbose=True)
In [14]:
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
"features/pulldown.interactions.interpolate.abundance.targets.txt",
verbose=True, missinglabel="0")
In [15]:
y = loadtxt("features/pulldown.interactions.interpolate.abundance.targets.txt")
In [30]:
print "Average value of abundance feature: {0}".format(mean(y[y>1]))
In [17]:
import pickle
In [43]:
f = open("forGAVIN/pulldown_data/dataset/abundance.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()
To make sure that linear regression isn't going to work better on this dataset than it did with the affinity notebook we should try to fit a linear regression model in this case as well:
In [34]:
#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")
In [35]:
import sklearn.utils
In [36]:
X,y = sklearn.utils.shuffle(X,y)
In [37]:
import sklearn.cross_validation
In [38]:
kf = sklearn.cross_validation.KFold(y.shape[0],10)
In [40]:
import sklearn.linear_model
In [41]:
scores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
linreg = sklearn.linear_model.LinearRegression()
linreg.fit(X_train,y_train)
#test the classifier
scores.append(linreg.score(X_test,y_test))
Got tired of waiting.
In [42]:
print scores
Looks like there's very little advantage to a linear regression model here.